feat: create pdf parser with unstructured #8

ghoullishly · 2025-03-05T17:06:06Z

Fixes #4 : Experimented with unstructuredloader for pdf parsing. Has element_id, parent_id and document type metadata which is very useful. Coordinate metadata could be used to find images. Can (somewhat) handle math formulas which may be relevant if/when we expand to other subjects.

alvaro-mazcu · 2025-03-20T22:21:18Z

src/langchain_chunking.py

+import re
+import pathlib
+
+pdf_path = r"C:\Users\ADMIN\Desktop\KTHAIS\twiga-warehouse\data\parsed\text.md"


The path should be relative inside the project, so that every developer has its own absolute path

alvaro-mazcu · 2025-03-20T22:24:08Z

src/langchain_chunking.py

+    ("#####", "Header 4"), # Bold + italic
+]
+
+def preprocess_md(md_doc):


Can you type the input and the output? Example:

def foo(im_a_number: int) -> int: return im_a_number + 1

same for other functions

alvaro-mazcu · 2025-03-20T22:26:17Z

src/langchain_chunking.py

+    chunk_size = 250
+    chunk_overlap = 30


Maybe it would be interesting to declare these two variables as input of the functions so that we could play with them if needed. Example:

def recursive_split(md_header_splits, chunk_size: int = 250, chunk_overlap: int = 30): """Split document recursively.""" ...

alvaro-mazcu · 2025-03-20T22:28:39Z

src/langchain_chunking.py

+md_doc = preprocess_md(md_doc)
+
+# Split data
+md_header_splits = md_split(md_doc)
+#character_splits = recursive_split(md_header_splits)
+
+# Append metadata
+splits_data = []
+for split in md_header_splits:
+    splits_data.append({
+        "content": split.page_content,
+        "metadata": split.metadata
+    })
+
+# Save JSON output
+output_dir = pathlib.Path(r"C:\Users\ADMIN\Desktop\KTHAIS\twiga-warehouse\data\parsed")
+output_dir.mkdir(exist_ok=True)
+
+output_json_path = output_dir / "text.json"
+with open(output_json_path, "w", encoding="utf-8") as json_file:
+    json.dump(splits_data, json_file, ensure_ascii=False, indent=4)
+
+print(f"Markdown splits saved to {output_json_path}")


Can you move this inside a main function? This also implies creating the if __name__ == "__main__": famous line at the bottom of the file. Same for the other files

feat: create pdf parser with unstructured

f74a71e

ghoullishly requested a review from alvaro-mazcu March 5, 2025 17:06

Fixes issue #5: Test pymupdf4llm for parsing and langchain for chunking

5524150

alvaro-mazcu reviewed Mar 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: create pdf parser with unstructured #8

feat: create pdf parser with unstructured #8

ghoullishly commented Mar 5, 2025

alvaro-mazcu Mar 20, 2025

alvaro-mazcu Mar 20, 2025

alvaro-mazcu Mar 20, 2025

alvaro-mazcu Mar 20, 2025

alvaro-mazcu Mar 20, 2025

feat: create pdf parser with unstructured #8

Are you sure you want to change the base?

feat: create pdf parser with unstructured #8

Conversation

ghoullishly commented Mar 5, 2025

alvaro-mazcu Mar 20, 2025

Choose a reason for hiding this comment

alvaro-mazcu Mar 20, 2025

Choose a reason for hiding this comment

alvaro-mazcu Mar 20, 2025

Choose a reason for hiding this comment

alvaro-mazcu Mar 20, 2025

Choose a reason for hiding this comment

alvaro-mazcu Mar 20, 2025

Choose a reason for hiding this comment